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The problem is to determine how large a random sample is needed in 
order to attain a preassigned probability P*(\ ^ P* < 1) that the sample 
will possess a certain amount (or degree) of representativeness of the true 
unknown (cumulative) distribution F under study. The definition of repre- 
sentativeness involves two preassigned constants k and /3*(fc ^ 2 is an 
integer). For example, for k = 2 and any fi*(0 < /3* ^ \) the sample 
is defined to be representative if the proportion of the total sample size fall- 
ing on each side of the population median differs from \ by at most /3*. 
In this case the degree of representativeness is defined as d g * = 1 — 2fi*. 

This idea can be extended to any number k of disjoint, exhaustive cells 
equi-probable under F; tables and graphs are given for finite and infinite 
populations for selected values of k, /3* and P*. The definition is also 
extended to cases in which the experimenter is particularly interested in 
parts of F which are not equi-probable and/or parts of F which do not ex- 
haust the whole sample space; tables and graphs accompany each applica- 
tion. 

These results are non-parametric, i.e., if the prescribed sample size is 
used then the experimenter's requirements for representativeness will be 
satisfied whatever the unknown distribution. Derivations of exact and ap- 
proximate formulae used in computing tables are given in the Appendices. 

I. INTRODUCTION 

This paper deals with the problem of determining how large a random 
sample is needed in order to guarantee with preassigned probability P* 
that the sample will have a specified amount (or a specified degree) of 
representativeness of the true, unknown (cumulative) distribution F 
under study. No k priori information is given about F and no assumptions 
are made about the form of F. The solution given is nonparametric (i.e., 
distribution-free) so that the results obtained and the tables and graphs 
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constructed are valid for any true underlying distribution. The case of a 
finite population as well as that of an infinite population is considered; 
in the latter case it is assumed only for ease of exposition that those 
percentiles of F which enter the discussion are uniquely defined and have 
probability zero under F. (This will, in particular, be the case when F 
has a density function without zero-stretches between points having 
positive density.) 

A definition of representativeness (and also a degree of representative- 
ness) is given with respect to those parts of F which are between certain 
percentiles which we denote by F~ 1 (p i ), the values of p» being pre- 
assigned. The intervals between these percentiles will be called cells and 
we shall only consider collections of pairwise disjoint cells. For example 
the experimenter may want to guarantee with probability at least 
p* = 0.90 that between 40 per cent and 60 per cent of his sample will 
lie on each side of the population median. In this case we are interested 
in the part of F (or the cell) between ^ _1 (0) and F~\0.5) and also the 
part of F (or the cell) between F~\0.5) and F~\l). By the definitions 
below the common allowance @* is 0.10 and the degree of representative- 
ness d g * is 0.80 (or 80 per cent). Then we enter Table I (or II) with 
k = 2, P* = 0.90 and /3* = 0.10 and find that the smallest sample size 
needed to satisfy the experimenter's requirement for representativeness 
is n = 60. (It is instructive to note that the same solution would 
hold for any two disjoint, exhaustive subsets of the sample space having 
a common probability of £ under F. However, the cases in which we 
consider disjoint cells and, in particular, disjoint cells which start 
from one end or both ends of the distribution are of considerably more 
practical interest. The cell terminology will be used in the body of 
the paper while the subset terminology will be used in the appen- 
dices.) 

In the above example the sample space is broken up into two dis- 
joint, exhaustive cells which are equi-probable under F. This idea of rep- 
resentativeness can be extended to any number k of pairwise disjoint, ex- 
haustive cells equi-probable under F and in the numerical work the values 
k = 2, 3, 4, 5 and 10 are considered. The idea of representativeness can 
also be used with cells that are not equi-probable and/or with cells that 
do not exhaust the whole sample space. As an example of the first type 
(cells not equi-probable) we might be concerned about whether a sample 
is large enough to be simultaneously representative of a single tail with 
preassigned probability p < % under F and of its complement which has 
probability (1 — p) > \ under F. As an example of the second type 
(non-exhaustive cells) we might be concerned about whether a sample 
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is large enough to be representative of both tails (each having (say) a 
common preassigned probability p < \ under F), without any concern 
about the middle cell between the two tails. For each problem tables 
and graphs throughout this paper give the smallest required sample 
size for selected values of P* and specified amounts (or specified degrees) 
of representativeness. 

Assuming for the moment that the density of F is known and that all 
of its deciles are finite then we can plot an observed bar diagram (i.e., 
rectangles with different widths under the daslied lines in Fig. 1) and 
the true density on the same diagram as shown in Fig. 1 to illustrate the 
idea of a representative sample. By definition of a decile each of the ver- 
tical strips bounded above by the curve has an area (or probability under 
F) of 0.1. The observed sample is considered representative relative to 
this pattern of ten disjoint, exhaustive and equi-probable cells to within 
a common allowance /3* if simultaneously the areas of all vertical rectangles 
differ from the theoretical value of 0.1 by at most /3*(0 < P* ^ 0.1). 
Then the degree d„* of representativeness as defined in Section III is 
equal to 1 - 10/3*. We are interested in finding the smallest sample size 
needed to guarantee a probability of at least P* that the above condition 
will hold in a sample drawn at random from F. 

This problem is related to the well-known problem 1 of Kolmogorov- 
Smirnov since they both have the common purpose of determining 
the sample size required to obtain a representative sample. Since their 
definition of representativeness is different from the one treated here, 
it is difficult to make a proper comparison of the two procedures. Another 
remark on this comparison is made in Appendix IV. 

II. DEFINITION OF REPRESENTATIVENESS 

Lei /''denote the true unknown cumulative distribution and let /•'„* de- 
note the observed sample distribution based on n observations. For any 
given k let (\ ,('■>, • • • , Ck denote pairwise disjoint cells (not necessarily 
exhaustive or equi-probable under F) which are defined by certain per- 
centiles. The cells d , ('■> , ••■ ,Ct are not known but then- probabilities 
under F are given positive numbers; let F{C,) denote the probability 
assigned to C, by the distribution F(i = 1, 2, • • • , /,). (We are using F 
and F„* as symbols for both point functions and probability measures 
which are set functions; clearly, the nature of the argument will prevent 
any confusion.) Let /3,* denote specified positive numbers (which we shall 
call allowances) such that 

<Bf ^ F(d) (i = 1,2, ••• , fc). (1) 



138 THE BELL SYSTEM TECHNICAL JOURNAL, JANUARY 1958 

We shall be particularly interested in the special case 0* = (J-* = • • • = 
ft,* = (3* (say), whether or not the quantities F(C,-) are all equal. Then 
a sample is defined to be representative relative to a fixed pattern of 
k disjoint cells C x , C 2 , ■ • • , CV- to within the allowances &*, (3 2 * ■ ■ • , ft*, 
respectively, if we have simultaneously 

| F n *(d) - F(d) | ^ ft* (t = 1, 2, • • • , lb). (2) 



III. DEFINITION OF DEGREE OF REPRESENTATIVENESS 

Although the quantities &*{i = 1, 2, • • • , k) are basic to the idea of 
representativeness it may be useful, in a given problem, to combine them 
to define a measure of the degree of representativeness. We define 

v-feb-mY (3) 

where the subscript g denotes the fact that d * is a geometric mean. It 
follows from (1) that S£ d B * < 1 and that d g * can take on all the values 
in this interval. 

It should be noted that for any fixed set of values of F{Ci) 
(i = 1, 2, • • • , k) if there is a common j3* then the right hand member of 
(3) is a strictly decreasing function of (3* for 0* f£ min F(d). Hence, if 
there is a common /3* the values of d B * and (3* uniquely determine each 
other. When this is the case we may be interested sometimes in specify- 
ing d* (instead of IS*) and then using (3) to solve for the common /3*. 

We shall say that a random sample is representative relative to a fixed 
pattern of k disjoint cells C\ , C 2 , ■ ■ • , C* to a degree d * if for the com- 
mon /3* = &*{d *) satisfying (3) we have 

\F*{C % ) -F(d)\ ^/3* (i= 1,2, ■■■,k). (4) 

It should be emphasized that the chief interest of this paper is in the 
concept of representativeness as formulated in Section II and that the 
present definition of the degree of representativeness is to be regarded 
as supplementary. 

One possible criticism of the definition of d g * is that it may require a 
positive (and sometimes substantial) number of observations to attain a 
zero degree of representativeness (see, for example, the last and third 
from last columns in Table III). However, since the practical use of the 
concept of degree of representativeness is mainly for large values of d* 
this objection is not serious. 
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It is possible also to define the degree of representativeness as an 
arithmetic mean d a * of the bracketed quantities in (3) but then for a 
common (8* and different F(d), because of (1), the value of d a * is re- 
stricted to an interval ./ ^ d* < 1 where J is 'positive and depends on 
the values of the F(d) (t = 1,2, •••,&)■ Clearly, if the F(C.) are all 
equal and there is a common (3* then c/ a * = d g *. 

IV. CONSTRUCTION OF TABLES 

The problem is to find the smallest sample size n such that the joint 
probability of all the inequalities (2) [or (4)] is at least equal to a specified 
value P* < 1, i.e., such that 

P[ I F n *(C<) - F(d) | ^ Pt*(i - 1, 2, • • • , ft)} fc P* (5) 

The reader is cautioned that it does not necessarily follow that (5) 
holds for any integer greater than n ; however, since F n * converges almost 
certainly to F (see page 20 of Reference 2), it follows that there exists 
in each case a smallest number n' ^ n such that (5) holds for every 
integer greater than or equal to n' . For example, with k = 2, a common 
0* = 0.20 and P* = 0.75 the condition (5) is satisfied for n = 3, for 
G and for any integer greater than or equal to n' = 9. 

Since the cells d are pairwise disjoint and the values of F(d) are given 
(i = 1,2, ■ • ■ , k) the left member of (5) is determined for any particu- 
lar sample size whatever the unknown distribution F. In the case of an 
infinite population we use the multinomial distribution with k or k + 1 
disjoint cells depending on whether or not the k disjoint cells are exhaus- 
tive, i.e., on whether or not X]f=iF(C) = 1. For the case of two dis- 
joint, exhaustive cells this clearly reduces to a problem of the binomial 
distribution which is closely related to the problem of finding confidence 
limits on a population percentile by the use of order statistics. Similarly 
in the case of a finite population we use the hypergeometric distribution 
with />• or k -f- 1 categories depending on whether or not E*-*W0 = 1. 
The exact and approximate formulae for computing the left member of 
(5) are given in Appendices I and II, respectively. The approximate cal- 
culation involves several interesting geometrical digressions which are 
discussed in Appendix III. 

Table I gives for k = 2 and selected values of 0* and P* the required 
sample sizes n and n' and also the maximum drop in probability below 
the specified /■** for all sample sizes between n and n' . In the remaining 
tables only the values of n are given. Table II gives the required sample 
size for k = 2, F(d) = V, W 2 ) = 1 - p f or p = 0.5, 0.2 and 0.1 (for 
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Table I 

Sample size required to attain a probability P* that a sample will be 
simultaneously representative to within a common allowance /3* of two 
disjoint and exhaustive cells separated by the median for any true dis- 
tribution. 

In each set the first entry is the smallest sample size required to satisfy 
(4) ; the second entry is the smallest size required such that for all 
sample sizes at least as large, (4) is satisfied; the last entry is the maxi- 
mum deviation in probability below P* obtained for all sample sizes 
between the first two entries. 





0.01 


0.05 


0.10 


0.15 


0.20 


0.25 


0.40 


0.50 


1051 

1199 

(0.0264) 


31 

59 

(0.1271) 


5 

14 

(0.2266) 


5 
10 

(0.1875) 


2 

5 

(0.1250) 


2 
2 

(0) 


2 
2 

(0) 


0.60 


1700 

1850 

(0.0162) 


60 
79 

(0.0704) 


5 
24 

(0.3266) 


5 
10 

(0.2875) 


3 

8 

(0.2250) 


3 

3 

(0) 


3 
3 

(0) 


0.70 


2600 

2750 

(0.0124) 


100 

119 

(0.0382) 


20 

29 

(0.1049) 


8 

16 

(0.2078) 


3 

8 
(0.3250) 


3 
6 

(0.0750) 


3 
3 

(0) 


0.75 


3251 

3399 

(0.0077) 


120 

150 

(0.0407) 


25 
39 

(0.0769) 


11 

16 

(0.1377) 


3 

9 

(0.3750) 


3 

6 
(0.1250) 


3 
3 

(0) 


0.80 


4051 

4199 

(0.0058) 


151 

179 

(0.0328) 


35 
44 

(0.0430) 


14 

24 

(0.0518) 


9 

12 

(0.0266) 


4 

7 

(0.0750) 


4 
4 
(0) 


0.85 


5100 

5250 

(0.0052) 


191 

219 
(0.0269) 


45 

54 

(0.0434) 


17 

27 

(0.0879) 


10 
15 

(0.0766) 


4 
10 

(0.1250) 


4 
4 
(0) 


0.90 


6700 

6850 

(0.0029) 


260 

279 

(0.0129) 


60 

74 

(0.0299) 


28 

33 

(0.0360) 


13 

18 

(0.0796) 


8 

11 

(0.0797) 


5 
5 
(0) 


0.95 


9551 

9699 

(0.0012) 


371 

399 
(0.0070) 


90 

99 

(0.0114) 


37 

47 

(0.0230) 


20 

28 

(0.0284) 


12 

15 

(0.0423) 


6 
6 
(0) 


0.99 


16500 
16650 
(0.0003) 


651 

679 

(0.0013) 


160 

169 

(0.0022) 


71 

76 

(0.0028) 


39 

42 

(0.0015) 


24 

26 

(0.0046) 


8 
12 

(0.0017) 



For n S 150 the entries are all exact; for n > 150 the entries involve 
muttons. The pattern of increases and decreases of the probability as a 
of n was also used to obtain the first two entries for large n. 



approxi- 
function 
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selected values of 0* and P*). Table III gives the required sample size 
for the case of k pairwise disjoint, exhaustive and equi-probable cells 
(C\ , C 2 , • • • , C fc ) for k = 2, 3, 4, 5 and 10 (for selected values of 0* 
and P*). Table IV gives the required sample size for k = 2, F(Ci) = 
F(C 2 ) = p for p = 0.2, 0.1 and 0.05 (here the cells are disjoint and equi- 
probable but not exhaustive). Table V considers the same problem as 
in Table III and compares the required sample sizes for infinite popula- 
tions, A/ = oo , with those for finite populations of size N for N = 60, 
120, 360. Tables VI and VII give illustrations of the error involved in 
using the approximations used in Tables IV and V, respectively, instead 
of an exact probability calculation. 

Tig. 2 shows for selected values of P* that the sample sizes in Table I 
and in the first portion of Table II can be "linearized" for large n on a log- 
log plot of n versus /3*. Figs. 3 and 4 show the same result for the last 
and middle portion of Table II, respectively. 



Table II 



Minimum sample size required to attain a probability of at least P* that 
a sample will he simultaneously representative to within a common 
allowance 0* of two disjoint and exhaustive cells separated by the 100 
/;th percentile for any true distribution. (The degree of representative- 
ness is then defined as d g * = A/ ( 1 J '.I — =-— ) J 









20th or 80th Percentile 




10th or 90th Percentile 




(P = 


0.50) 






(p = 0.20 or 0.80) 




(p = 0.10 or 0.90) 


y 

p '\ 


0.01 

1,051 


0.05 


0.10 


0.1S 

5 


0.20 


0.01 


0.05 


0.10 


0.15 

6 


0.20 
It 


0.01 


0.05 


0.10 


50 


31 


5 


at 


662 


12 


7 


355 


14 


It 


60 


1,700 


60 


5 


5 


3t 


1,062 


32 


7 


6 


it 


500 


14 


It 


70 


2,600 


100 


20 


8 


3t 


1,662 


52 


10 


9 


It 


900 


20 


It 


75 


3,251 


120 


25 


11 


3t 


2,062 


72 


10 


9 


It 


1,100 


40 


It 


SO 


4,051 


151 


35 


14 


9 


2,562 


92 


20 


12 


It 


1,400 


40 


It 


85 


5,100 


191 


45 


17 


10 


3,262 


120 


27 


12 


3t 


1,800 


60 


It 


90 


6,700 


251 


60 


28 


13 


4,262 


160 


37 


15 


5 


2,355 


80 


It 


95 


9,551 


371 


90 


37 


20 


6,100 


232 


50 


20 


10 


3,400 


120 


10 


0.99 


16,500 


651 


160 


71 


39 


10,562 


420 


100 


40 


20 


5,900 


220 


15 



For n ^ 150 the entries are all exact; for n > 150 the entries are based on ap- 
proximations together with a knowledge of the monotonicity pattern of the 
probability of representativeness as a function of n. 

t Small entries for certain pairs (0*, P*) indicate a condition too weak for prac- 
tical usage. 



Table III 
Minimum sample size required to attain a probability of at least P* that 
a sample will be simultaneously representative to within a common 
allowance /3* of k equi-probable disjoint and exhaustive cells for any 
true distribution. (The degree of representativeness is then defined as 
d g * = 1 - kB*). 





k = 2 


* = 3 


k = i 


k = 5 


* = 


in 


v 


0.05 


0.10 


0.20 

2 


0.05 


0.10 


0.20 


0.05 


0.10 


0.20 


0.05 


0.10 


0.20 

5 


0.05 


0.10 


0.50 


31 


5 


102 


21 


6 


120 


26 


9 


120 


30 


100 


20 


0.60 


60 


5 


3 


141 


30 


6 


140 


38 


9 


140 


30 


5 


100 


20 


0.70 


100 


20 


3 


180 


47 


12 


180 


43 


12 


180 


40 


5 


120 


30 


0.75 


120 


25 


3 


222 


51 


14 


200 


52 


14 


200 


50 


10 


120 


30 


0.80 


151 


35 


9 


240 


60 


15 


240 


60 


14 


220 


50 


10 


140 


30 


0.85 


191 


45 


10 


300 


72 


15 


280 


66 


16 


240 


60 


15 


160 


30 


0.90 


251 


60 


13 


360 


90 


21 


320 


80 


18 


280 


70 


15 


160 


40 


0.95 


371 


90 


20 


480 


120 


29 


400 


100 


27 


360 


90 


23 


200 


50 


0.99 


651 


160 


39 


741 


180 


45 


600 


146 


38 


500 


120 


35 


260 


60 



For k ^ 3 probabilities were computed exactly only for n ^ (200/ A") ; for n > 
(200/k) the approximation in Appendix 2 was used together with a knowledge of 
the monotonicity pattern of the probability of representativeness as a function 
of n. 

Table IV 
Minimum sample size required to attain a probability of at least P* 
that a sample will be simultaneously representative to within a common 
allowance 0* of any two disjoint equi-probable cells defined by percen- 
tiles and having a common probability p under the true, unknown dis- 
tribution. (The degree of representativeness is then defined as d g * = 
1 - 0*/p.) 





Below 20th and Above 


Below 10th and Above 


Below 5 th and Above 


Application 


80th Percentiles 


90th Percentiles 


95th Percentiles 




(/> = 0.20) 


(p = 0.10) 


(P = 0.05) 


">^-C 


0.01 


0.05 


0.10 


0.01 


0.05 


0.10 


0.01 


0.05 


0.50 


1,700 


52 


10 


900 


20 


It 


450 


H 




0.60 


2,262 


72 


10 


1,255 


40 


It 


600 


1 




0.70 


3,000 


112 


20 


1,655 


54 


It 


850 


r 




0.75 


3,500 


132 


30 


1,955 


60 


It 


1,000 


r 




0.80 


4,100 


152 


30 


2,300 


80 


It 


1,150 


l 




0.85 


4,900 


180 


40 


2,700 


100 


10 


1,400 


l 




0.90 


6,000 


232 


50 


3,355 


120 


20 


1,750 


i 




0.95 


7,900 


300 


70 


4,455 


160 


35 


2,250 


80 


0.99 


12,562 


492 


120 


7,000 


274 


65 


3,650 


130 


Another 


Between 30th and 50th 


Between 40th and 


Between 45th 


Application 


percentiles and be- 


50th percentiles and 


and 50th per- 




tween 50th and 70th 


between 50th and 


centiles and 




percentiles 


60th percentiles 


between 50th 
and 55th per- 
centiles 



For n ^ 40 the entires are exact; for n > 40 normal approximation theory 
was used. 

t Small entires for certain pairs (fi*, P*) indicate a condition too weak for 
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Table V 
Minimum sample size required to attain a probability of at least P* 
that a sample from a population of size N will be simultaneously repre- 
sentative to within a common allowance 0* of k equi-probable disjoint 
and exhaustive cells for any true population. (The degree of represen- 
tativeness is then defined as d g * = 1 — fc/3*). 

The four entries in each set below correspond to N = 60, 120, 3G0, » , 
respectively. 





k = 2 


k = 3 


* = 4 


k = 5 


* = 


10 


v 


0.05 


0.10 


0.20 

2 


0.05 


0.10 


0.20 


0.05 


0.10 


0.20 


0.05 


0.10 


0.20 

3 


0.05 


0.10 


0.50 


20 


5 


40 


19 


6 


40 


20 


7 


40 


20 


34 


10 




20 


5 


2 


55 


21 


6 


60 


20 


7 


60 


20 


b 


54 


15 




20 


5 


?, 


81 


21 


6 


80 


20 


7 


80 


24 


5 


74 


lo 




31 


5 


2 


102 


21 


6 


120 


26 


7 


120 


30 


b 


100 


20 


0.75 


40 


15 


3 


47 


28 


12 


47 


26 


12 


45 


27 


8 


40 


20 




60 


20 


3 


7fi 


37 


14 


74 


38 


12 


72 


30 


8 


60 


2b 




91 


25 


3 


136 


49 


14 


130 


40 


14 


120 


40 


10 


94 


2b 




120 


25 


3 


222 


51 


15 


200 


52 


14 


200 


50 


10 


120 


30 


0.85 


51 


25 


9 


53 


30 


14 


50 


32 


14 


49 


30 


10 


40 


20 




71 


30 


10 


84 


49 


15 


80 


40 


14 


80 


40 


10 


60 


25 




120 


40 


10 


162 


60 


15 


150 


58 


16 


152 


50 


13 


100 


30 




191 


45 


10 


300 


72 


15 


280 


66 


16 


240 


60 


15 


160 


30 


0.00 


51 


30 


10 


54 


37 


15 


50 


38 


16 


51 


30 


13 


40 


2b 




80 


40 


13 


93 


51 


19 


90 


46 


16 


80 


40 


13 


74 


2b 




151 


50 


13 


180 


72 


21 


170 


60 


18 


160 


60 


15 


114 


3b 




251 


60 


13 


360 


90 


21 


320 


80 


20 


280 


70 


15 


160 


40 


0.95 


51 


35 


16 


54 


42 


21 


50 


38 


18 


52 


37 


lb 


47 


2b 




91 


50 


19 


94 


60 


25 


90 


58 


20 


92 


50 


15 


74 


30 




180 


70 


20 


201 


88 


27 


190 


80 


25 


180 


70 


18 


120 


40 




371 


90 


20 


480 


120 


30 


400 


100 


27 


360 


90 


20 


200 


50 


0.99 


60 


45 


23 


55 


48 


27 


57 


43 


25 


53 


40 


20 


49 


30 




100 


70 


30 


102 


72 


30 


100 


66 


29 


98 


60 


23 


80 


40 




231 


110 


36 


240 


120 


42 


220 


100 


34 


212 


90 


2b 


154 


50 




651 


160 


39 


741 


180 


45 


600 


146 


37 


500 


120 


30 


260 


60 



For finite populations all entries withn g 2/0* are based on exact computations; 
the entries with n > 2//3* are based on the approximation in equation (A17) of 
Appendix II. Another simpler approximation is given in equation (A18) of Ap- 
pendix II. 



Table VI 
Comparison between the exact value of and the normal approximation 
to the joint probability that in a sample of size n from an infinite popu- 
lation the number of observations falling in each of two tails with com- 
mon probability p is between n(p — 0*) and n(p + /?*), inclusive. 







p = 0.10 
P* = 0.05 


p = 0.20 
p* = 0.05 


p = 0.20 
p* = 0.10 


n = 10 


Normal Approx. 

Exact 

Error 


0.1628 

0.1510 

+0.0118 


0.0973 

0.0941 

+0.0032 


0.5910 

0.6014 

-0.0104 


n = 20 


Normal Approx. 

Exact 

Error 


0.5432 

0.5566 

-0.0134 


0.3654 

0.3648 

+0.0006 


0.7075 

0.7171 

-0.0096 


n = 40 


Normal Approx. 

Exact 

Error 


0.6608 

0.6731 

-0.0123 


0.4655 

0.4669 

-0.0014 


0.8574 

0.8736 

-0.0162 



Table VII 
Comparison between the exact value of and the normal approximation 
to the joint probability that in a sample of size n from a population of 
size N the number of observations falling in each of A; equi-probable cells 



is between n[ T — — 
v/c 20 



and n( - + — ) , inclusive. 
\k 20} 

AT = oo (Infinite Population) 







k = 2 


k =3 


k = 4 


k = 5 


k = 10 


It 


= 20 


Normal Approx. 

Exact 

Error 


0.4977 

0.4966 

+0.0011 


0.1166 

0.1145 

+0.0021 


0.1600 

0.1618 

-0.0018 


0.1172 

0.0955 

+0.0217 


0.0698 

0.0669 

+0.0029 


n 


= 40 


Normal Approx. 

Exact 

Error 


0.5708 

0.5704 

+0.0004 


0.2196 

0.2181 

+0.0015 


0.2388 

0.2363 

+0.0025 


0.1962 

0.1904 

+0.0058 


0.1775 

0.1478 

+0.0297 


n 


= 60 


Normal Approx. 

Exact 

Error 


0.6338 

0.6338 
0.0000 


0.3974 

0.3982 

-0.0008 


0.3230 

0.3174 

+0.0056 


0.2876 

0.2979 

-0.0103 


0.3325 

* 

* 






JV 


= 120 (Fi 


nite Population) 










k = 2 


k - 3 


k = 4 


k = s 


k = 10 


n 


= 20 


Normal Approx. 

Exact 

Error 


0.5357 

0.5368 

-0.0011 


0.1397 

0.1359 

+0.0038 


0.1984 

0.1801 

+0.0183 


0.1550 

0.1547 

+0.0003 


0.1092 

0.1011 

+0.0081 


n 


= 40 


Normal Approx. 

Exact 

Error 


0.6651 

0.6670 

-0.0019 


0.2822 

0.3084 

-0.0262 


0.3705 

0.3679 

+0.0026 


0.3413 

0.3313 

+0.0100 


0.4291 

0.3357 

+0.0934 


n 


= 60 


Normal Approx. 

Exact 

Error 


0.7969 

0.7989 

-0.0020 


0.6338 

0.6104 

+0.0234 


0.6115 

0.6003 

+0.0112 


0.6228 

0.5972 

+0.0256 


0.8507 

* 

* 



REPRESENTATIVENESS OF A SAMPLE 



145 




F-'(O) F-'(b.2) F-'(0.4) F-'(0.6) F" 1 (0.8) f=~H\.0) 

Fig. 1 — Pictorial diagram of representativeness using deciles (k = 10). 




Fig. 2 — Minimum sample size n required to attain a probability of at least P* 
that a sample is simultaneously representative to within a common allowance 0* 
of two disjoint and exhaustive cells each having probability p = i i under the true 
unknown distribution. (The degree of representativeness is d a * = 1 — 20*.) 
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Fig. 3 — Minimum sample size n required to attain a probability of at least 
P* that a sample is simultaneously representative to within a common allowance 
/3* of the two disjoint, exhaustive cells separated by the 10th (or the 90th) per- 
centile for any true distribu tion. [The degree of representativeness is d„* = 
Cf) V(0.1 - W) (0.9 - 0*).] 




0.94 



Fig. 4 — Minimum sample size n required to attain a probability of at least P* 
that a sample is simultaneously representative to within a common allowance /3* 
of the two disjoint, exhaustive cells separated by the 20th (or the 80th) per- 
ce ntile for any true distr ibution. [The degree of representativeness is d„* = (f) 
V(0.2 - (3*) (0.8 - /3*).] 
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V. EMPIRICALLY OBSERVED MONOTONICITIES 

It is interesting to note in Table III that for fixed 0* and increasing 
k the sample size n required is not monotonic but appears to reach a 
maximum and then decrease. As a result of this it becomes possible to 
to speak of the sample size n required for a sample to be representative 
for any specified /3* regardless of the number k of pairwise disjoint, ex- 
haustive, equi-probable cells considered, provided only that k ^ 1//3*. 
For example, for 0* = 0.1 it appears likely from Table III that 90 ob- 
servations would be sufficient to have a confidence of at least P* = 0.90 
that the sample is representative in the sense of (2) for any one value of 
k(k = 1,2, •••,10). 

Table VIII, some of whose entries are taken from Table III, shows 
numerically that for fixed d* the required sample size is a monotonically 
non-decreasing function not only of P* but also of k; for fixed 0*. Table III 
shows numerically that only the monotonicity with P* holds. The former 
result is again shown in Figs. 5 and G which also emphasize the possi- 
bilities of interpolation on k. 

The above monotonicities and lack of monotonicities have not been 
demonstrated mathematically. 



Table VIII 

Minimum sample size required to attain a probability of at least P* that 
a sample will be simultaneously representative to a degree d„* = 1 - fc/3* 
of k equi-probable disjoint and exhaustive cells for any true distribu- 
tion. 





dg* = 0.80 


d * = 0.90 




k = 2 


* = 4 


* = 10 


k = 2 


* = 5 


k = 10 


0.50 


5 


120 


600 


31 


800 


2500 


0.60 




140 


700 


60 


950 


2800 


0.70 


20 


180 


800 


100 


1150 


3200 


0.75 


25 


200 


850 


120 


1250 


3400 


0.80 


35 


240 


900 


151 


1400 


3700 


0.85 


45 


280 


1000 


191 


1600 


4000 


0.90 


60 


320 


1100 


251 


1850 


4400 


0.95 


90 


400 


1250 


371 


2250 


5100 


0.99 


160 


600 


1650 


651 


3150 


6600 



In comparing results for n fixed deyrcc d g * it, should he noted that the sample 
size appears to he a monotonically non-decreasing function of P* and also of A-; 
for a fixed common allowance 0* only the monotonicity with P* holds as is evident 
in Table II. The remarks a) the bottom of Table III apply here also. 



\ 
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VI. CONFIDENCE BANDS — INFINITE POPULATION CASE 

The experimenter will usually be interested in the confidence state- 
ment that the above formulation allows him to make after the observa- 
tions are taken. Suppose, for example, that he was interested in representa- 
tiveness in each of k = 10 pairwise disjoint, exhaustive and equi-probable 
cells and that he specified 0* = 0.02 (so that d* = 0.80) and P* = 0.85 
and that he has taken 1,000 observations in accordance with Table VIII. 
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Fig. 5 — Minimum sample size n required 
to attain a probability of at least P* that a 
sample will he simultaneously representa- 
tive to a degree d g * = 0.00 of A- equi -proba- 
ble, disjoint and exhaustive cells for any 
true distribution. The common allowance 
0* is given by 0* = (1 - d„*)/k = 0.10/ft. 
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Fig. 6 — Minimum sample size n required 
to attain a probability of at least P* that a 
sample will be simultaneously representa- 
tive to a degree d„* = 0.80 of A- equi-proba- 
ble, disjoint and exhaustive cells for any 
true distribution. The common allowance 
0* is given by/3* = (1 - d„*)/k = 0.20/A. 
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He can then make a number of confidence statements about the popula- 
tion deciles F~\0.1), F~\0.2) } ■•• , F~\0.9) (and also about F~\0) 
and F _1 (l) denned as the greatest lower bound of all x for which F(x) > 
and the least upper bound of all x for which F(x) < 1, respectively). 
For example, if :r m denotes the mth (smallest) ordered observation, it 
follows from the condition of representativeness that we have simul- 
taneously with joint confidence greater than P* all of the inequalities 



- « ^ F~ (0) < Xi 
a*o ^ F~ 1 (0.\) < xm 
x m ^ F" l (0.2) < x m 



x m ^ F~\0.8) < Xm 

.1720 ^ F"'(0.9) < 00 
ziooo ^ F -1 (l) ^ oo 



(in (I 



•T.ooo < F~\l) ^ * 

.r 879 < F"'(0.9) g x m 

X m < F"'(0.8) ^ X840 

3- 39 < F _1 (0.2) ^ .T. i6 

- oc < F~'(0.1) ^ xm 

- oo ^ F~'(0) ^ x-i 



(<•») 



For example, F~'(0.2) must be greater than or equal to x m and less than 
&24i in the confidence statement since under the condition of represen- 
tativeness all cells and, in particular, the last two cells on the left con- 
tain between 80 and 120 observations, inclusive. 

The right hand set of inequalities are in reverse order since they are ob- 
tained by similar reasoning as the left hand set except that we start at 
the right end of the distribution and work backwards. If we keep only 
the stronger results in ((>) for each decile and disregard the weaker ones, 
then we obtain eleven (finite or infinite) line segments as in Fig. 7. We 
can then state with joint confidence greater than P* that the unknown 
distribution F has a (finite or infinite) point of contact with (or a saltus 
passing through) each of the line segments; the two end segments are 




Fig. 7 — Confidence intervals for the deciles with joint confidence level P* = 
0.85 for A- = 10, n = 1000 and (3* = 0.02 (which implies that d„* = 0.80). 
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actually half-lines and in these cases we must allow + °o and — oo as 
possible "points" of contact. 

The above result then gives rise to two "staircases", as in the middle 
diagram of Fig. 8, such that any distribution contacting every line seg- 
ment in Fig. 7 must everywhere lie between (or on the boundary of) the 
two "staircases". Hence we can state with confidence greater than P* 
(see explanation below) that the two "staircases" form a confidence 
band on the unknown distribution. 

If we keep k and P* fixed and decrease /3* (or increase d B * = 1 — k(S*) 



1.00 
0.80 
0.60 
0.40 
0,20 
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X121 X J4 , X 36 , X 4 „ X 60 , I 68 | X 76 , X 841 X 9 
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dg=0.9 

/3*=0.01 

n =4000 



360 j X 730 I ^lOJO I *1440 I X \W> | r ?J40 l ^2680 I X i\20 
441 ^(ll X l3 j, X,,,| I i£0 , X 256 , Xj W , 



J 3«41 



Fig. S — Confidence bands which include the true distribution function with 
confidence greater than P* = 0.85 for h = 10 and d u * = 0.5, 0.8, 0.9. Small circles 
between the confidence bands represent ordinates of the sample distribution 
function. The three figures above were constructed with observations obtained 
from a table of random normal deviates (with different horizontal scaling applied 
in each case). 
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then the required sample size increases and the confidence band becomes 
narrower. This is illustrated in the three diagrams of Fig. 8. 

It should be noted that the inequalities (6) are implied by but do not 
imply (i.e., they are not equivalent to) the condition of representative- 
ness. Hence the confidence level associated with (6) is greater than the 
specified P*. To illustrate this we note from (6) the stronger inequalities 

a- 8 o ^ F _I (0.1) < x m and x m ^ F _1 (0.2) < x ul . (7) 

These inequalities (7) allow as few as 40 and as many as 161 observations 
between F _1 (0.1) and F -I (0.2), including endpoints. On the other hand 
we have confidence P*, under the condition of representativeness, that 
every such cell contains between 80 and 120 observations, inclusive. This 
shows that the confidence level associated with the confidence band is 
greater than the probability achieved for the representativeness of the 
sample. 

This method of obtaining a confidence band for the unknown dis- 
tribution would be more valuable if we could obtain a simple way of 
computing (or estimating more accurately) the actual confidence level 
attained. For example, with k = 3, /3* = 0.10 (so that d* = 0.70) 
and P* = 0.60 we obtain n = 30 from Table III, the probability achieved 
for representativeness is 0.6369 and the confidence level associated with 
the two "staircases" is 0.6825. The latter is obtained by using inequali- 
ties similar to (6) and computing the probability exactly with a multi- 
nomial distribution. The reader should note that the idea of a confidence 
band containing the true, unknown distribution is not the main theme 
of this paper but only an interesting by-product of the idea of the repre- 
sentativeness of the sample. 

APPENDIX I 

Exact Form viae — Finite and Infinite Popvlations 

The concept of the representativeness of a sample can be applied to 
finite as well as infinite populations. Let N denote the total size of a 
finite population; conceptually we may regard the population as being 
partitioned into k subsets S { of size F(£<)(t = 1,2, • • • , k). We shall 
assume that the sets Si are pairwise disjoint and, to simplify the discus- 
sion, we also assume that the quantities A/, = NF(Si)(i = 1, 2,- • • , k) 
are positive integers. 

Let Xi ^ denote the random integral number of observations in the 
observed sample of size n which fall in the set Si(i = 1, 2, • • ■ , k). If 
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the k sets Si are exhaustive then 

Sti Xi = n and £Li M = N- (Al) 

We define for i = 1, 2, • • • , k 

a = n[F(Si) - ft*] and <2< = n[F(&) + ft*], (A2) 

which are non-negative but need not be integers. Then for a finite popu- 
lation the probability corresponding to the left number of (5), using the 
hypergeometric distribution, is given exactly by 

Pn lw) M , « , fc (i = 1, 2, • • • , ft)] = £ g (*<) / (*) (A3) 

where ( I is the usual binomial coefficient and the summation in (A3) 

is over all vectors x = [xi , x%, • ■ • , Xk\ for which 

d ^ Xi ^ di (i = 1, 2, • • • , k). (A4) 

If the ft sets are not exhaustive then we define another set Sk+i which is 
the complement of the union of the k sets S { and use (A3) with k replaced 
by fc + 1 in (Al) and (A3) but not in (A4), i.e., no condition is applied 
to the (k + l)th variable. 

In the case of an infinite population we use the multinomial distribu- 
tion. If the k sets Si are exhaustive, then using (A2) and letting p, = 
F(Si)(i = 1, 2, • ■ • , k) the left hand member of (5) is given exactly by 

n \ * 

p» ( "V , ft* a - 1, 2, • • ., am = e t— n (**■*) (as) 

n fco ,=l 

where the summation is again over all vectors x = [xi , x 2 , • • • , x*j 
satisfying (Al) and (A4). If the k sets are not exhaustive then we define 
Sk+i as above and the same expression (A5) is obtained with k replaced 
by k + 1 in (Al) and (A5) but not in (A4), i.e., no condition is applied 
to the (k + l)th variable. 

It is interesting to note that the results for the infinite case (N = oo ) 
can be obtained from those of the finite case by letting N tend to in- 
finity. Table V illustrates this numerically since the four entries in each 
set correspond to N — 00, 120, 360 and <x> , respectively. 

APPENDIX II 

Approximate Solutions — Infinite and Finite Populations 

Let .r, denote the random integral number of observations in a sample 
of size n which fall in the ith cell (i = 1, 2, • • • , k). If we let 
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y. = Xi - (n/k), then the two conditions XXi .r, = n and 

E« * = (A6) 

are equivalent. Let [x] denote the largest integer not greater than x. 
We shall consider only the case of the equi-probable exhaustive sets. 
In the case of an infinite population we wish to compute 



p = pLIt-pA ^* ^ »(i + 0-* 



k H ' = ' = \k 



a = 1,2, ..-,fc) 



k 



2>< = n . 



(A7) 



If we introduce a continuity correction and use (AG) then we obtain 

p = P\ -bi ^ y, ^ ai{i = 1, 2, - - • , A) |£i.i Vi = 0} (A8) 
where for each i(i = 1, 2, • • ■ , k) 



1 , 
a, = ^ + 



<*• + l] ~ I and * " 5 + [** " I 



+ £. (A9) 

k 



If w/fr is an integer and 0* is the common value of 0,-*(t = 1, 2, • • • , k) 
then a, = «•> = ■ • ■ = a* = h = 62 = ■ ■ • = &* = a (say) and (A8) 
reduces to 

P = P\ \y t \* a(i = 1, 2, • • • , /■•) iZti //-• = 0} (A10) 

where a = h + N 3 *]- 

To compute (A10) two approximations are made. The A-variate multi- 
nomial probability is first transformed by an orthogonal transformation 
into a (k — l)-variate distribution with homoscedastic and uncorrelated 
variables and the first approximation is to replace the latter distribution 
by a multivariate normal distribution with independent variables. The 
region of integration is the intersection of the hypercube | y, ■ | ^ a 
centered at the origin with edge-length 2a and the hyperplane (AG); 
the orthogonal transformation merely rotates this intersection about the 
origin. These intersections are convex figures symmetric with respect to 
the origin; for example, it is a regular centered hexagon for k = 3. These 
intersections, called Stott figures, are discussed in Appendix III. The 
second approximation made in (-(imputing (A 10) was to replace the 
Stott figure by a (k — I) -dimensional central sphere whose radius R 
is determined by equating the two hypervolumes. Values of R for k = 
2(1)12 for any a are given in Table IX. 
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Table IX 
Intersection d of the hypcrcube of edge-length 2a centered at the origin 
and the hyperplane Xx + Xi + ■ • ■ + x k = 0. 



Dimension k of 


J Ik) = Number of equally large 


Radius R of sphere with content 


hypcrcube 


simplices in a 


equal to that of 3 


2 


1 


1.4142 a 


3 


6 


1.2861 a 


4 


4 


1.3655 a 


5 


230 


1.4436 a 


6 


66 


1.5225 a 


7 


23,548 


1.5995 a 


8 


2,416 


1.6733 a 


9 


4,675,014 


1.7443 a 


10 


156,190 


1.8126 a 


11 


1,527,092,468 


1.8786 a 


12 


15,724,248 


1.9422 a 



The content I(k) of # for all k is given by 
Hk) = 



" [(§)(*)*"' - (i)(* - 2)" 4- (t)(k - 4)*-' - • • •] 



(k- 1)! 

where the terms continue only as long as the arguments k, k — 2, • ■ • are positive. 
The radius It of a (k — l)-dimensional sphere of equal content is obtained by 

equating I(k) and (R\/w)^ 1 / T 



The orthogonal transformation referred to above is 

y* = VWTT) to + w + ' ■ • + v* - tv«J (aid 

(i = 1,2, ••-,/,-) 
where 7/*+i is defined to be identically zero. Then y k ' is identically zero 

71/ 

by (A6). The remaining y/ all have a common variance - since for each 
i(i = 1,2, ••-,/>:- 1) 

<r,,.' = ... . ,, \i\i + l;n 



*(* + i) 



/,;- 



and are pairwise nncorrelated since for i < j 



(M2) 
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ni(k — 1) 



+« ex- »)+*-<> (-*)-*-» (ai3) 

(_»)_»^i) +w _^(_«)}_ . 

If we let v = k — 1, let r = tf/o- = Ry/k/n and let S denote the 
central sphere of radius r then the approximate probability (dropping 
primes) is given by 

p = !,■■■! (l)'~ exp { - \ s *'} ^ * ■ ■ ■ * (ah) 

= P fx, 2 ^ r 2 } 

where x» 2 denotes a chi-square random variable with v degrees of freedom. 
In the case of a finite population of size AT the only change in the above 
discussion is to replace (A 12) by 

^'l{w^l) « = 1.V ••,*-» (A15) 

thus increasing the value of r and the value of P; this decreases n if P 
is held fixed at any P*. If we let n N and n„ denote the required values 
for a finite population of size N and an infinite population, respectively, 
for the same fixed k, (3* and P* then we obtain from (A14) and (A15) 



tt„o = riff 



N ~ n ^ (Alfl) 



N - 1. 

or, taking the smaller solution in n N , we have for large .V 



n-tt = 



N - y/W - A{N - \)n„ (A17) 



Replacing N — 1 by N in (A1G) we easily obtain for large N the simpler 
result 

1.^1-1 (A18) 

The error in P involved in both of the above approximations (A 14) 
and (A17) is evaluated in Table VII for N = 120 and N = oo for se- 
lected values of n, 0* and k. 

If n/k is not an integer then the above discussion may not apply since 
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• 

a, may not equal 6, in (A9). Assuming again a common (3* then we have 
a common "a" and a common "6" in (A9). In this case, averaging the 
approximate probabilities obtained by using 2a and 2b alternately as 
the edge-length of the hypercube was found to be satisfactory for com- 
puting the tables of this paper. 

APPENDIX III 

Geometric Results and Eulerian (Diamond) Nimibers 

The problem here is to find the (/*• — 1) -dimensional content (or hyper- 
volume) of the intersection d of the centered A-dimensional hypercube 
\ Ui\ < a(i = 1,2, • • ■ , k) and the (k — l)-dimensional hyperplane 
lh -\- y-i + • • • + l/k = 0. The geometry for even k and odd k is quite 
different. The number of vertices of d for even k and odd k, respectively, 
is 

w) a » d * (»-"»/»)' (A19) 

for example, for k = 3 we obtain the 3 ( I =6 vertices (a, —a, 0), 

( — a, a, 0), (a, 0, —a), ( — a, 0, a), (0, a, — a) and (0, —a, a). The vertices 
tire all equally distant from the origin. All the edges of d have a common 
length d = d(k) which equals 2a\/2 for even k and ay/2 for odd k. The 
intersection 6 is a convex figure which is symmetric with respect to the 
origin and is known as a Stott figure. The Stott figure can be parti- 
tioned into an integral number J(k) of (k — 1) -dimensional simplices 
which are not necessarily regular but are such that each simplex has the 
same content as a regular (k — l)-dimensional simplex with edge- 
length d. Hence, using a result on page 125 of Reference 8, the content 
I(k) of tf is given by 

™-(¥TVV»- (A20) 

The integers J(k) are given in the middle column of Table IX; for ex- 
ample, the integer (i for k — 3 indicates that there are six equilateral 
triangles in the centered hexagon. 

D. Slepian has shown that for even k the integers J(k) can be found 
by generating a "triangle" of numbers using the recurrence relation 

Si,, = jSi-u + iSi.i-4 (i, j = 1,2, • • •) (A21) 

with boundary conditions »S'i,> = »S>,i = 1 for all j; then the desired 
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quantities are 

S iti = J(2i) (i = 1,2, •■•)• (A22) 

Similarly for odd k he showed that we can use the recurrence relation 

r, . = (2j + l)r*-i.i + (2i + UTij-i (i, j = 1, 2, • • ) (A23)" 

with boundary conditions 7' ,> = r,.o = 1 for all j; then the desired 
quantities are 



T iti = J(2i+ 1) (i = 1,2, 



(A24) 



Fig. 9 shows these numbers in two diamond-shaped patterns and ex- 
plains another interesting way of obtaining these numbers. 



EVEN CASE 



ODD CASE 




A/V \ 

l u II J 

26 66 26 I 

/N b d!\) 4\ ®\ 4 




57 302 302 57 

' \ d\)4<§4 

K \9\ 2416 "91 



CD \®\p 

1 76 230 /u i, 

237 1682 1682 237 

/N fe </^2) © © ® 



S) ® oy © 



15,619 
'\ 



10,543 23.548 10.543 
® (J) © (g 



15,619 

(5) v 



'56,190 
/ \ 



259,723 259,723 

(?) ® 

4,675,014 



Fig. 9 — Combinatoric derivation of certain Eulerian (diamond) numbers. 
The number at any vertex V is obtained by considering anyone path from the top 
vertex to V, multiplying the circled numbers encountered in this path, and sum- 
ming the results obtained over all possible downward paths from the top vertex 
to V. In particular, the values on the vertical diagonal (of the diamond) are the 
values of ./(A) in Table IX. It is interesting to note that the sum of all the un- 
circled numbers in the with row is 2" _1 (/» - 1 ) ! for the odd case and m ! for the even 
case. This is shown above for m = \, 2, 3, 4, 5 and would hold for all in if this 
pattern were continued indefinitely. The circled numbers are obtained by num- 
bering the parallel diagonal lines starting with one at the "top," using all pos- 
itive integers in the even case and only odd integers in the odd case. 



158 THE BELL SYSTEM TECHNICAL JOURNAL, JANUARY 1958 

The integers J (k) arise in connection with combinatorial problems. 
As an example for even k, suppose we draw at random m balls in suc- 
cession from an urn containing m balls marked 1,2, • • • , m. Let X de- 
note the number of times that the observed number increases, (say) 
always counting the first draw as an increase. Then it can be shown that 

P{X - .;'} = S,-. m+w /m! (j = 1, 2, • • •, m), (A25) 

i.e., the mth row of the left diamond Fig. 9 divided by the sum ml of 
that row gives the elementary probability distribution of X. 

The problem of computing (A25) also arose in the work of V. H. 
Moore and W. A. Wallis 4 and M. MacMahon 3 who referred to it as 
Simon Newcomb's problem. J. Riordan 5 has studied the numbers ./(ft) 
for even k and Carlitz and Riordan 5 call them Eulerian numbers (to 
be distinguished from the classical Euler numbers) ; an explicit formula 
as well as a generating function appears in these papers. The Si,j are 
related to the Eulerian numbers A„, k (defined in Reference 5) by St.j = 
Ai+j-i.j . 

Explicit expressions for J(k) for odd and even A: are obtainable from 
(A22), (A24) and the more general results 

Su= Zi-lTCVKj-a)^- 1 (A20) 

Ti,i = E (-l)T' + i +1 )[20' - a) + l] i+i (A27) 

o=0 

due to D. Slepian. 7 It is easily shown that these formulae satisfy the 
corresponding recurrence relations as well as the boundary conditions. 
By an induction and symmetry argument applied to (A21) and (A2.3) 
and the boundary conditions it is easy to prove that 

Si, j = Sj,i and Tij = T jA . (A28) 

Substituting (A26) and (A27) in (A28) gives rise to interesting, non- 
trivial identities. For completeness we also give the generating functions 
derived by D. Slepian 7 

■ *&£ < A29 > 
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The final result for the content /(A-) of S can, using the above be 
written as a single expression 

/or aft /»• where [jb] denotes the largest integer not greater than aJ. It has 
been pointed out by J. W. Tukey that (A32) can also be obtained by proba- 
bilistic considerations and that it appears in Laplace's "Theorie Ana- 
lytique" (Book 2, page 260). 

APPENDIX IV 

Remarks on the Confidence Bands 

It should be remarked that other assumptions on the true, unknown 
distribution can be used in conjunction with the confidence bands ob- 
tained in Section VI. It has been pointed out by J. W. Tukey, for example, 
that in the case of the first diagram in Fig. 8 the experimenter might be 
willing to assume that the true distribution is unimodal and that the 
mode x m is such that x m ^ .r 64 . Then on purely geometrical considera- 
tions it can be shown that the confidence band can be modified as shown 
in the first diagram of Fig. 10. Briefly, if the true distribution enters any 
one of the three deleted triangles with any slope s then in order to get 
out again without leaving the confidence band the slope must get larger 
than s. But this contradicts the assumption that the density steadily 
decreases after x M . 

Similarly, with the same problem, if the experimenter assumes that 
the true distribution is unimodal and that x 13 ^ x m ^ x m then the first 
diagram of Tig. 8 can be modified as in the second diagram of Fig. 10. 
The assumption of unimodality is reasonable in many different practical 
applications but has not often been utilized in statistical techniques. 

It is possible to formulate a problem for fixed P* and n which requires 
the determination of that k which makes the maximum (or some average) 
vertical width of the confidence bands as small as possible. For example, 
for P* = 0.85 and n = 240 the value k = 10 minimizes the maximum 
vertical width. It should be pointed out that if the experimenter's prin- 
cipal interest is in finding confidence bands with small vertical widths 
then this procedure appears to be quite inefficient compared with that 
based on the Kolmogorov statistic. 

A proper comparison is difficult since the nominal P* is a lower bound 
and not the correct value of the confidence level associated with the pro- 
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Fig. 10 — Modified confidence bands which include the true distribution func- 
tion with confidence greater than P* = 0.85 for A; = 10 and cl a * = 0.5. 

posed confidence bands. As mentioned in the body of the paper the de- 
velopment of a confidence band is just a by-product of the main theme 
of this paper which is the representativeness of the sample. 



VII. CONCLUSIOX 

Definitions of representativness and of degree of representativeness are 
given and tables are included which give the sample size required to 
guarantee with preassigned probability P* that a random sample will 
satisfy a condition of representativeness, the definition of which is 
agreed upon in advance. Thus, for experimenters who wish to know in 
advance how many observations will be needed for a distribution study, 
the problem has been given a precise nonparametric formulation and the 
solution has been found for some cases. 

This formulation also leads to confidence bounds on the unknown 
distribution after the observations are taken. Examples are given to illus- 
trate this. 

The tables for the case of pairwise disjoint, cqui-probable and exhaus- 
tive cells may also prove to be useful for the problem of determining the 
sample size required to obtain simultaneous confidence limits (on a 
preassigned level P*) for all of the cell probabilities of a multinomial 
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distribution. Further investigation is needed to state precisely the con- 
ditions under which these tables can be used for this related problem. 

VIII. ACKNOWLEDGEMENT 

The authors wish to acknowledge the encouragement of the Bell 
Laboratories staff at Allentown and various helpful suggestions and aid 
received from D. Blepian, S. Monro, J. W. Tukey, J. Riordan, T. W. 
Anderson and others. The assistance of Miss Phyllis Groll in checking 
tables and proofreading is gratefully acknowledged. 

REFERENCES 

1. Birnbaum, Z. W., Numerical Tabulation of the Distribution of Kolmogoroy's 

Statistic for Finite Sample Size, Journal of the American Statistical Associa- 
tion, 47, Sept., 1952. 

2. Loeve, M., Probability Theory, D. Van Nostrand Company, New York, 1955. 

3. MacMahon, P. A. Combinatory Analysis, Vol. 1, p. 187, Cambridge, 1915. 

4. Moore, V. H. and Wallis, W. A., Time Series Significance Tests Based on Signs 

of Differences, Journal of the American Statistical Association, 38, June, 
1943. 

5. Riordan, .)., Triangular Permutation Numbers, Proc. of the Amer. Math. Soc, 

2, June, 1951; Carlitz, L. and Riordan, J., Congruences for Eulerian Numbers, 
Duke Math. J., 20, September, 1953. 

6. Schoute, Analytical Treatment of the Polytopes Regularly Derived from the 

Regular Polytopes (IV), Verhandelingen der Koninklijke Akademie van 
Wetenschappen te Amsterdam (eerste sectie), 11.5, pp. 73-108, 1913. 

7. Slepian, D., On the Volume of Certain Polvtopes, internal B.T.L. memoran- 

dum, April 11, 1956. 

8. Somcrville, D.M.Y., An Introduction to the Geometry of N Dimensions, p. 125 

E. P. Dutton and Company, N. Y. 1929. 



